class: center, middle, inverse, title-slide # PLSC30500, Fall 2022 ## Week 5. Estimation (1) --- <!-- conditional expectation before lecture 5. --> <!-- how does independence ensure that E[Y_i(1) | D_i = 1] = E[Y_i(1) | D_i = 0] --> # Big picture/motivation Week 1: Programming kick-start Week 2: Probability Week 3: Summarizing distributions Week 4: Identification and causal estimands **Today, Week 5**: Estimation (learning from random samples) --- class: inverse, middle, center # Sampling --- # Independent and identically distributed (IID) **Definition 3.1.1** Let `\(X_1, X_2, \ldots, X_n\)` be random variables with CDFs `\(F_1, F2, \ldots, F_n\)`, respectively. Let `\(F_A\)` denote the joint CDF of the random variables with indices in the set `\(A\)`. Then `\(X_1, X_2, \ldots, X_n\)` are *independent and identically distributed* if they satisfy the following: - Mutually independent: `\(\forall A \subseteq \{1, 2, \ldots, n\}, \quad \forall (x_1, x_2, \ldots, x_n) \in \mathbb{R}^n, F_A\left((x_i)_{i \in A}\right) = \prod_{i \in A} F_i(x_i)\)` - Identically distributed: `\(\forall i, j \in \{1, 2, \ldots, n\}\)` and `\(\forall x \in \mathbb{R}, F_i(x) = F_j(x).\)` -- Suppose we're doing a survey. `\(X_i\)` is the voting intention (1 if D, 0 if R) of survey respondent `\(i\)`. -- What would make our survey sampling process iid? What would make it not iid? ??? Make it iid: Random sampling with equal probability of selection and with replacement, or random sampling without replacement from a large population Make it not iid: Random sampling w/o replacement from a small population; sequential random sampling with probability based on previous draws; sampling where e.g. 1 and 2 are the same person by design, so `\(X_1 = X_2\)` --- # IID? - Say we are doing random sampling for a survey. Subject 1 is a Democrat. Subject 2 is a Republican. How can this be IID? i.e. seems like `\(X_1\)` and `\(X_2\)` are not identically distributed. -- - Say we are doing random sampling for a survey. Subject 1 and Subject 2,498 are from the same household. How can this be IID? i.e. seems like `\(X_1\)` and `\(X_{2498}\)` are not independent. -- Keep in mind that i.i.d. sampling does not imply/require a homogeneous population. ??? A&M imply that it is not IID if two subjects are from the same househod, but I think they are confusing IID sampling and IID errors given a model. --- # The essential role of sampling "Uncertainty" in frequentist statistics arises because - our estimands are characteristics of the whole population, but - we have a finite sample from that population -- What could the answer be if we observed the whole population? -- Sometimes we work with surveys. Then the above makes sense. -- What about when we're studying e.g. OECD countries? Sampling from a *super-population*. -- A&M endorse Angrist's pragmatic view (footnote 7). (There is also the Bayesian alternative.) -- <!-- # Sampling and uncertainty --> <!-- "Uncertainty" in frequentist statistics arises because --> <!-- - our estimands are characteristics of the whole population, but --> <!-- - we have a finite sample from that population --> <!-- -- --> So we will try to figure out how far our estimate could be from the estimand. This depends on the variance (across samples) of the estimator. --- class: inverse, middle, center # Learning from samples --- # Sample statistic **Definition 3.2.1** *Sample statistic* For i.i.d. random variables `\(X_1, X_2, \ldots, X_n\)`, a *sample statistic* is a function of `\(X_1, X_2, \ldots, X_n\)`: `$$T_{(n)} = h_{(n)}(X_1, X_2, \ldots, X_n)$$` where `\(h_{(n)}: \mathbb{R}^n \rightarrow \mathbb{R}, \forall n \in \mathbb{N}\)`. -- Examples of sample statistics: sample mean, sample variance, sample covariance, regression coefficient --- # Sample mean For i.i.d. random variables `\(X_1, X_2, \ldots, X_n\)`, the *sample mean* is `$$\overline X = \frac{X_1 + X_2 + \ldots + X_n}{n} = \frac{1}{n} \sum_{i = 1}^{n} X_i$$` -- Proof that `\(E[\overline{X}] = E[X]\)` (Theorem 3.2.3): `$$\begin{aligned} E[\overline{X}] &= E\left[\frac{1}{n}(X_1 + X_2 + \ldots + X_n) \right] \\\ &= \frac{1}{n} E\left[X_1 + X_2 + \ldots + X_n \right] \\\ &= \frac{1}{n} \left( E[X_1] + E[X_2] + \ldots + E[X_n] \right) \\\ &= \frac{1}{n} \left( n E[X] \right) \\\ &= E[X] \end{aligned}$$` --- # Sampling variance of the sample mean How much would the sample mean `\(\overline{X}\)` vary across iid random samples? <!-- Imagine: --> <!-- For i.i.d. variables `\(X_1, X_2, \ldots, X_n\)` with finite variance `\(V[X]\)`, the *sampling variance* of `\(\overline{X}\)` is --> <!-- `$$V[\overline{X}] = \frac{V[X]}{n}$$` --> -- Theory: `$$\begin{aligned} V[\overline{X}] &= V\left[\frac{1}{n}(X_1 + X_2 + \ldots + X_n) \right] \\\ &= \frac{1}{n^2} V\left[X_1 + X_2 + \ldots + X_n \right] \\\ &= \frac{1}{n^2} \left( V[X_1] + V[X_2] + \ldots + V[X_n] \right) \\\ &= \frac{1}{n^2} \left( n V[X] \right) \\\ &= \frac{V[X]}{n} \end{aligned}$$` -- Explain in words. --- # Simulation We showed that `$$V[\overline{X}] = \frac{V[X]}{n}$$` How could we show/test that it is true? -- <br> 1. Explain a procedure in words first 1. Implement it in `R` (problem set) --- class: inverse, middle, center # Coding detour: iteration --- # Iteration We have learned how to take a sample of size `\(n\)` from a distribution/population: ```r n <- 5 x <- c(1,2,5) sample(x, size = n, replace = T) ``` ``` ## [1] 5 5 5 1 2 ``` ```r sample(x, size = n, replace = T, prob = c(.5, .3, .2)) ``` ``` ## [1] 5 1 5 1 1 ``` ```r rnorm(n, mean = 0, sd = 1) ``` ``` ## [1] -0.2762190 0.5593851 0.1070105 -0.2432412 -0.8490575 ``` -- But how do we do this `\(m\)` times? (e.g. `\(m = 10\)` times) --- # Iteration (1) Easiest way (base `R`): `replicate()` ```r n <- 5 m <- 10 x <- c(1,2,5) sample(x, size = n, replace = T) ``` ``` ## [1] 1 5 5 1 2 ``` ```r replicate(m, sample(x, size = n, replace = T)) ``` ``` ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] 1 1 5 2 2 5 2 1 5 2 ## [2,] 5 2 1 5 2 2 5 2 1 1 ## [3,] 1 5 1 1 2 5 5 1 5 5 ## [4,] 2 1 2 1 1 1 2 1 5 2 ## [5,] 1 5 2 1 5 5 5 1 5 1 ``` ```r set.seed(60637) replicate(m, mean(sample(x, size = n, replace = T))) ``` ``` ## [1] 3.0 4.2 2.8 2.6 2.0 4.4 2.2 3.4 2.8 3.6 ``` --- # Iteration (2) Another way: `for`-loops. ```r (sample_means <- rep(NA, m)) ``` ``` ## [1] NA NA NA NA NA NA NA NA NA NA ``` ```r set.seed(60637) for(i in 1:m){ sample_means[i] <- mean(sample(x, size = n, replace = T)) } sample_means ``` ``` ## [1] 3.0 4.2 2.8 2.6 2.0 4.4 2.2 3.4 2.8 3.6 ``` --- # Iteration (3) The "tidy" way: `map` ```r set.seed(60637) map_dbl(1:10, ~ mean(sample(x, size = n, replace = T))) ``` ``` ## [1] 3.0 4.2 2.8 2.6 2.0 4.4 2.2 3.4 2.8 3.6 ``` -- `map` allows you to pass vectors of arguments: ```r map_dbl(c(1,3,5), ~ mean(rnorm(n = 50, mean = .x, sd = 1))) ``` ``` ## [1] 0.8462248 3.0035396 4.9732586 ``` ```r map2_dbl(c(1,3,5), c(5, 30, 10000), ~ mean(rnorm(n = .y, mean = .x, sd = 1))) ``` ``` ## [1] 0.8340716 3.0117169 5.0058276 ``` <!-- # tibble(n = c(5, 30, 10000), mean = c(1,3,5), sd = c(5, 2, 1)) %>% --> <!-- # mutate(samp = pmap(., rnorm), --> <!-- # samp_mean = map_dbl(samp, ~mean(.x))) --> --- class: inverse, middle, center # Asymptotic properties of the sample mean --- # Weak law of large numbers We noted that `\(E[\overline{X}] = E[X]\)`: sample mean is unbiased estimator of expected value. -- WLLN says: the larger the sample `\(n\)`, the closer the `\(\overline{X}\)` gets to `\(E[X]\)`: **Theorem 3.2.8** Let `\(X_1, X_2, \ldots, X_n\)` be i.i.d. random variables with finite variance `\(\text{V}[X] > 0\)`, and let `\(\overline{X}_{(n)} = \frac{1}{n} \sum_{x=1}^n X_i\)`. Then `$$\overline{X}_{(n)} \overset{p}{\to} \text{E}[X]$$` -- In words, the larger the sample size, the closer the sample mean `\(\overline{X}\)` should get to the expected value `\(E[X]\)`. -- Kind of boring/obvious. --- # Repeated sample means Define random variable `\(X\)`: $$ f(x) = \begin{cases} 1/5 & x = 1 \\\ 2/5 & x = 2 \\\ 1/5 & x = 3 \\\ 1/5 & x = 5 \\\ 0 & \text{otherwise} \end{cases} $$ Repeatedly draw a sample of size `\(n = 2\)` with replacement, record sample mean. What will the distribution of these sample means look like? --- # Repeated sample means (2) <img src="data:image/png;base64,#slides_iqss_week_5_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- # Repeated sample means (3) Now try `\(n = 1000\)`: <img src="data:image/png;base64,#slides_iqss_week_5_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- # CLT **Theorem 3.2.24** Central limit theorem Let `\(X_1, X_2, \ldots, X_n\)` be i.i.d. random variables with finite `\(E[X] = \mu\)` and finite `\(V[X] = \sigma^2 > 0\)`. Then $$ \frac{\sqrt{n} \left(\overline{X} - \mu\right)}{\sigma} \overset{d}{\to} N(0, 1)$$ -- This means that for a large enough `\(n\)`, `\(\overline{X}\)` should be distributed roughly `\(N(\mu, \sigma^2/n)\)`. -- **Note**: - `\(E[\overline{X}] = E[X]\)` and `\(V[\overline{X}] = V[X]/n\)` are true for any sample size. - The rest of the CLT (i.e. normality) applies asymptotically (i.e. large enough `\(n\)`). --- # Illustration using above example <img src="data:image/png;base64,#slides_iqss_week_5_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- ### CLT intuition: why is `\(\overline{X}\)` normally distributed, even when `\(X\)` is not? `$$\begin{aligned} \overline{X} &= \frac{X_1 + X_2 + \ldots + X_n}{n}\\\ &= \frac{(E[X] + \epsilon_1) + (E[X] + \epsilon_2) + \ldots + (E[X] + \epsilon_n)}{n} \\\ &= \frac{nE[X] + \sum_{i = 1}^n \epsilon_i}{n} \\\ &= E[X] + \frac{1}{n} \sum_{i = 1}^n \epsilon_i \\\ \end{aligned}$$` where `\(\epsilon_i \equiv X_i - E[X]\)` ("deviations"). -- Note that `\(\text{E}[\epsilon_i] = \text{E}[X_i] - E[E[X]] = E[X] - E[X] = 0\)`. -- And `\(E[\sum_{i = 1}^n \epsilon_i] = \sum_{i = 1}^n E[\epsilon_i] = 0\)`. -- Given large enough `\(n\)`, `\(|\sum_{i = 1}^n \epsilon_i|\)` should be close to zero, with larger values less likely. This is true regardless of distribution of `\(X\)`. --- # Combinatorics link Suppose you are flipping a coin `\(1000\)` times. How many ways are there to get - 0 heads? - 1 head? - 2 head - 500 heads? -- $${n \choose k} = \frac{n!}{k!(n - k)!} $$ -- In `R`: ```r choose(n = 1000, k = 1000) ``` ``` ## [1] 1 ``` ```r choose(n = 1000, k = 500) ``` ``` ## [1] 2.702882e+299 ``` -- If normalized, this is the PMF for the RV "number of heads in `\(n\)` flips". --- # Combinatorics link (2) Compute the number of ways to get `\(k \in 0, 1, \ldots, 1000\)` heads in 1000 flips: ```r ks <- 0:1000 n <- 1000 ``` -- ```r # sapply nways <- sapply(ks, choose, n = n) head(nways, 4) ``` ``` ## [1] 1 1000 499500 166167000 ``` -- ```r # for-loop nways_2 <- rep(NA, 1001) for(k in ks){ nways_2[k+1] <- choose(k = k, n = n) } head(nways_2, 4) ``` ``` ## [1] 1 1000 499500 166167000 ``` --- # Combinatorics link (3) A "tidy" way to compute the number of ways to get `\(k \in 0, 1, \ldots, 1000\)` heads in 1000 flips: ```r # map dat <- tibble(k = ks) |> mutate(nways_3 = map_dbl(k, choose, n = n)) dat ``` ``` ## # A tibble: 1,001 × 2 ## k nways_3 ## <int> <dbl> ## 1 0 1 e 0 ## 2 1 1 e 3 ## 3 2 5.00e 5 ## 4 3 1.66e 8 ## 5 4 4.14e10 ## 6 5 8.25e12 ## 7 6 1.37e15 ## 8 7 1.94e17 ## 9 8 2.41e19 ## 10 9 2.66e21 ## # … with 991 more rows ``` --- # Combinatorics link (4) ```r dat |> ggplot(aes(x = k, y = nways_3)) + geom_point() ``` <img src="data:image/png;base64,#slides_iqss_week_5_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- # Combinatorics link (5) ```r dat |> mutate(prop = nways_3/sum(nways_3), dn = dnorm(x = k, mean = 500, sd = sqrt(1000*.5^2))) |> filter(k %in% 400:600) |> ggplot(aes(x = k, y = prop)) + geom_point() + geom_line(aes(y = dn), col = "red") ``` <img src="data:image/png;base64,#slides_iqss_week_5_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- # Combinatorics link: the bottom line When flipping 1000 coins, it is possible to get any number of heads between 0 and 1000, but there are more ways to get around 500 heads, so it is more likely. -- How much more likely? Combinatorics and normal distribution say the same thing. -- Similarly, when computing a sample mean of `\(X\)`, many values are likely but there are more ways to get something close to `\(E[X]\)`. CLT says that, given large enough `\(n\)`, the normal distribution tells you how much more likely. -- Fundamentally, it's because independent random deviations usually roughly cancel out. --- class: bg-full background-image: url("data:image/png;base64,#assets/galton_board.png") background-position: center background-size: contain # Galton board (quincunx) ??? Source: https://youtu.be/3m4bxse2JEQ?t=51 --- # CLT: big enough sample? <img src="data:image/png;base64,#slides_iqss_week_5_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> --- # CLT: big enough sample? <img src="data:image/png;base64,#slides_iqss_week_5_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- # CLT: big enough sample? <img src="data:image/png;base64,#slides_iqss_week_5_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- # CLT: big enough sample? <img src="data:image/png;base64,#slides_iqss_week_5_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- class: inverse, middle, center # Plug-in principle --- # Plug-in principle **Plug-in principle**: "Write down the feature of the population that we are interested in, and then use the sample analog to estimate it" (A&M, page 116) - sample mean `\(\overline{X} = \frac{1}{n}\sum_i X_i\)` is sample analog of `\(\text{E}[X] = \sum_x x f(x)\)` - (plug-in) sample variance `\(\overline{X^2} - \overline{X}^2\)` is sample analog of `\(\text{V}[X] = E[X^2] - E[X]^2\)` - OLS estimates are sample analog of best (MSE-minimizing) linear predictor --- # Plug-in principle (2) **Plug-in principle**: "Write down the feature of the population that we are interested in, and then use the sample analog to estimate it" (A&M, page 116) -- - In theory - CDF `\(F(x)\)` contains all information about R.V., empirical CDF `\(\hat{F}(x)\)` contains all information about sample - empirical CDF `\(\hat{F}(x)\)` unbiased and consistent for `\(F(x)\)` - our estimands can be expressed as functions of the CDF `\(F(x)\)` - for "plug-in estimator", we use `\(\hat{F}(x)\)` instead -- - In practice - to define estimand, we use "population quantities" (e.g. `\(E[X]\)`) instead of CDF `\(F(x)\)` - to get "plug-in" estimate, we use sample quantities (e.g. `\(\overline{X}\)`) instead -- **Implication**: what we learned about how `\(\overline{X}\)` relates to `\(\text{E}[X]\)` applies to more complicated estimands --- # An application of the plug-in principle Suppose we want to calculate the variance of `\(X\)` from a sample. -- Recall the definition of variance: `$$\text{V}[X] = E[X^2] - E[X]^2$$` -- But we have only a sample: we don't know `\(E[X^2]\)` or `\(E[X]\)`. -- Plug-in principle: use the sample mean `\(\overline{X}\)` for `\(E[X]\)`, use `\(\overline{X^2}\)` for `\(E[X^2]\)`! -- `$$\hat{\text{V}}_{\text{plug-in}}[X] = \overline{X^2} - \overline{X}^2$$` -- How would you compute that, given a sample `x` as below? ```r x <- c(1,5,2,6,3,4,2) ``` ??? ```r mean(x^2) - mean(x)^2 ``` ``` ## [1] 2.77551 ``` --- # Bias of the plug-in sample variance estimator Recalling that `\(V[\overline{X}] = E[\overline{X}^2] - E[\overline{X}]^2\)`, `$$\begin{aligned} E[\overline{X^2} - \overline{X}^2] &= E[\overline{X^2}] - E[\overline{X}^2] \\\ &= E[X^2] - \left(E[\overline{X}]^2 + V[\overline{X}]\right) \\\ &= E[X^2] - E[X]^2 - \frac{V[X]}{n} \\\ &= \overbrace{V[X]}^{\text{target}} - \overbrace{\frac{V[X]}{n}}^{\text{variance of }\overline{X}} \\\ &= \frac{n - 1}{n} V[X] \end{aligned}$$` -- To show via simulation: -- - define RV `\(X\)` - draw `\(m\)` samples of size `\(n\)`, compute plug-in sample variance - compare average estimate to `\(V[X]\)` --- # Simulation <img src="data:image/png;base64,#slides_iqss_week_5_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> (See Rmd for code.) --- # Intuition The variance measures the average squared difference from the mean: `$$V[X] = E[(X - E[X])^2]$$` -- If we knew `\(E[X] = \mu\)`, then our plug-in sample variance estimator would be $$\frac{1}{n} \sum_{i = 1}^n (x_i - \mu)^2 $$ and this is unbiased (as in simulation above). -- But we don't know `\(E[X]\)`, so we estimate using `\(\overline{X}\)`. -- If our sample is unrepresentative in one direction, so is `\(\overline{X}\)`, making the variance appear smaller (by `\(V[\overline{X}]\)`!). --- # Variances "So far, we have discussed the population variance `\(V[X]\)`, and an estimator thereof, namely, the unbiased sample variance, `\(\hat{V}[X]\)`. Additionally, in Section 3.2.1 (*Sample Means*), we showed that the sampling variance of the sample mean is `\(V[\overline{X}] = \frac{V[X]}{n}\)`. (Do not confuse these three quantities.)" (A&M, page 114) -- In illustrating `\(V[\overline{X}]\)` we assumed knowledge of the distribution of `\(X\)`. But how can we estimate `\(V[\overline{X}]\)` given a sample? (Also why would we do that?) -- Natural choice is `\(\hat{V}[X]/n\)`. <!-- Suppose our estimand is the ATE: --> <!-- `$$\text{ATE} \equiv E[Y_i(1) - Yi(0)] = E[Y_i(1)] - E[Yi(0)]$$` --> <!-- "Sample analogue" of `\(E[Y_i(1)]\)`: `\(\frac{\sum_i Y_i D_i }{\sum_i D_i}\)` --> <!-- "Sample analogue" of `\(E[Y_i(0)]\)`: `\(\frac{\sum_i Y_i (1 - D_i) }{\sum_i (1 - D_i)}\)` --> <!-- -- --> <!-- This estimator is unbiased if `\(E[Y_i | D_i = 1] = E[Y_i(1)]\)` and `\(E[Y_i | D_i = 0] = E[Y_i(0)]\)` (as in an experiment) but otherwise biased. --> <!-- -- --> <!-- In observational studies we can usually do better by "conditioning on" covariates. --> <!-- # Stability of normal distribution --> <!-- If `\(X\)` and `\(Y\)` are independent and drawn from a normal distribution, then `\(X+Y\)` is distributed normally. --> <!-- Illustration: --> <!-- ```{r echo = F} --> <!-- x <- rnorm(n = 10000, mean = 4, sd = 1) --> <!-- y <- rnorm(n = 10000, mean = -4, sd = 1.5) --> <!-- bind_rows(tibble(value = x, sample = "X"), --> <!-- tibble(value = y, sample = "Y"), --> <!-- tibble(value = x+y, sample = "X+Y")) |> --> <!-- ggplot(aes(x = value, fill = sample, col = sample)) + --> <!-- geom_histogram(bins = 50, position = "identity", alpha = .5) + --> <!-- geom_vline(xintercept = c(4, -4), lty = 2) + --> <!-- labs(fill = "", col = "") --> <!-- ``` --> <!-- # CI for the sample mean --> <!-- Suppose we have a RV `\(X\)` with `\(\text{V}[X] = 1\)`. --> <!-- ```{r} --> <!-- expand_grid(eps = c(.05, .075, .1, .125, .15), --> <!-- n = seq(1000, 100000, by = 1000)) |> # c(1000, 5000, 10000, 50000, 100000)) |> --> <!-- mutate(ub_pr_dev_gt_eps = 1/(n*eps^2)) |> --> <!-- ggplot(aes(x = n, y = ub_pr_dev_gt_eps, lty = factor(eps), group = factor(eps))) + --> <!-- geom_line() + --> <!-- scale_x_log10() + --> <!-- coord_cartesian(ylim = c(0, 1)) --> <!-- ``` --> <!-- --- --> <!-- # CLT: big enough sample? --> <!-- ```{r, echo = F, cache = T, fig.height = 6} --> <!-- expand_grid(n = c(3, 10, 20, 30, 100, 1000), --> <!-- rep = 1:10000) |> --> <!-- mutate(samp = map(n, rnorm, mean = 10, sd = 3)) -> dat --> <!-- plot_function(dat) --> <!-- ``` --> <!-- --- -->